Skip to contents

Introduction

This article explains how to import and process data with the annex package when the require data is available as tabular text files (CSV).

To demonstrate this, two files are used called demo_Bedroom.txt (contains the measurement data) as well as demo_Bedroom_config.TXT (contains configuration; see article Config file).

Both files can easily be read using base R functions, namely read.table() and its interfacing functions like read.csv(), utils::read.delim() etc. (see ?read.table for more details).

Reading the data

The first step is to import both (i) the measurement data (stored on raw_df) and (ii) the configuration (stored on config):

raw_df <- read.csv("demo_Bedroom.txt")
config <- read.table("demo_Bedroom_config.TXT",
                     comment.char = "#", sep = "",
                     header = TRUE, na.strings = c("NA", "empty"))
                     # see ?read.table for details

# Class and dimension of the objects
c("raw_df" = is.data.frame(raw_df), "config" = is.data.frame(config))
## raw_df config 
##   TRUE   TRUE
cbind("raw_df" = dim(raw_df), "config" = dim(config))
##      raw_df config
## [1,]  51890      8
## [2,]      8      5

Both objects are of class data.frame (tibble data frames to be precise) with a dimension of \(51890 \times 8\) (raw_df) and \(8 \times 5\) (config) respectively.

The first few observations (rows) of the two objects look as follows:

head(raw_df[, 1:4], n = 3) # First three columns only
##                     X radonShortTermAvg temp humidity
## 1 2011-01-01 00:01:26               151 18.8       51
## 2 2011-01-01 00:06:25               151 18.8       51
## 3 2011-01-01 00:11:25               151 18.8       51
head(config, n = 3)
##     column variable     study        home room
## 1        X datetime      <NA>        <NA> <NA>
## 2      co2      CO2 DEMO_STUD Casa_Blanca  Bed
## 3 humidity       rH DEMO_STUD Casa_Blanca  Bed

The object raw_df contains variables (columns) named “X”, “radonShortTermAvg”, “temp”, “humidity” which are the original names from the XLSX sheet, the config object contains the definition what the columns in raw_df contains and where they belong to. For more details read the article about the Config file.

Checking the config object

To check whether or not the config object is as expected by the annex package, the function annex_check_config() can be used. In case problems would be detected, an error will be thrown (see Config file). Else, the function is silent as in this example:

… no errors, the config object meets the annex requirements. Note that this step is not necessary as it will be performed automatically when calling annex_prepare() but can be handy during development.

Preparing data

While raw_df contains the raw data set, the config object contains the information on how to rename the columns and where the observations belong to. prepare_annex() is a helper function to prepare the data set for further steps.

prepared_df <- annex_prepare(raw_df, config, quiet = TRUE)
##  [1] "datetime" "study"    "home"     "room"     "CO2"      "Light"   
##  [7] "Pressure" "Radon"    "RH"       "T"        "VOC"
## Error in annex_prepare(raw_df, config, quiet = TRUE): variable `datetime` (originally column `X`) must be of class POSIXt

At this moment we get an error as the variable containing the date and time information is not a proper datetime object (object of class POSIXt) but a character. As the information comes in a proper ISO format, we simply convert the column (column X in raw_df) and call annex_prepare() again.

# see ?as.POSIXct for details and options
raw_df <- transform(raw_df, X = as.POSIXct(X, tz = "UTC"))
class(raw_df$X)
## [1] "POSIXct" "POSIXt"
prepared_df <- annex_prepare(raw_df, config, quiet = TRUE)
head(prepared_df)
##              datetime     study        home room CO2 Light Pressure Radon RH
## 1 2011-01-01 00:01:26 DEMO_STUD Casa_Blanca  BED 470     0   1026.5   151 51
## 2 2011-01-01 00:06:25 DEMO_STUD Casa_Blanca  BED 477     0   1026.5   151 51
## 3 2011-01-01 00:11:25 DEMO_STUD Casa_Blanca  BED 483     0   1026.5   151 51
## 4 2011-01-01 00:16:25 DEMO_STUD Casa_Blanca  BED 477     0   1026.5   151 51
## 5 2011-01-01 00:21:25 DEMO_STUD Casa_Blanca  BED 481     0   1026.4   151 51
## 6 2011-01-01 00:26:25 DEMO_STUD Casa_Blanca  BED 483     0   1026.4   168 51
##      T VOC
## 1 18.8 136
## 2 18.8 142
## 3 18.8 131
## 4 18.8 140
## 5 18.8 135
## 6 18.7 131

annex_prepare() performs a series of tasks:

  • Checking the config object (calls annex_check_config() internally). If the config object is valid,
  • the variables (columns) in raw_df are renamed and checked to be of the correct class,
  • informs the user if there are any columns in raw_df not included in config (just a note) and additional columns defined in config which do not occur in raw_df, and returns the modified (possibly subsetted) object,
  • ensures that datetime is a proper datetime object (POSIXt).

The checks of missing/additional definitions in config are intended to inform the user about possible misspecifications and will not result in an error.

Performing the analysis

Once the data set is prepared properly (note that annex_prepare() is a convenience function, can also be done manually) the final object can be prepared.

Prepare annex object

annex() is the creator function which creates an object of class annex (S3) providing a series of methods and functions to conduct the final analysis.

The function expects a formula as input which describes how to process the data. The three parts of the formula are:

  • <measurements to be processed> ~ <datetime> | <grouping variables>
  • The first part defines which variables (measurements) should be processed
  • Part two is always ~ datetime; the date and time information for the statistics
  • Part three the grouping, typically study + home + room
annex_df <- annex(Radon + T + VOC ~ datetime| study + home + room, data = prepared_df)
head(annex_df)
##              datetime     study        home room season   tod Radon    T VOC
## 1 2011-01-01 00:01:26 DEMO_STUD Casa_Blanca  BED  12-02 23-07   151 18.8 136
## 2 2011-01-01 00:06:25 DEMO_STUD Casa_Blanca  BED  12-02 23-07   151 18.8 142
## 3 2011-01-01 00:11:25 DEMO_STUD Casa_Blanca  BED  12-02 23-07   151 18.8 131
## 4 2011-01-01 00:16:25 DEMO_STUD Casa_Blanca  BED  12-02 23-07   151 18.8 140
## 5 2011-01-01 00:21:25 DEMO_STUD Casa_Blanca  BED  12-02 23-07   151 18.8 135
## 6 2011-01-01 00:26:25 DEMO_STUD Casa_Blanca  BED  12-02 23-07   168 18.7 131
class(annex_df)
## [1] "annex"      "data.frame"

A series of S3 methods exist for annex objects which might be extended in the future.

Performing analysis

Based on the object returned by annex() the analysis can be performed by calling annex_stats(). The function aggregates the data based on the formula provided above, calculates a series of statistical properties, and returns an object of class annex_stats.

head(annex_df)
##              datetime     study        home room season   tod Radon    T VOC
## 1 2011-01-01 00:01:26 DEMO_STUD Casa_Blanca  BED  12-02 23-07   151 18.8 136
## 2 2011-01-01 00:06:25 DEMO_STUD Casa_Blanca  BED  12-02 23-07   151 18.8 142
## 3 2011-01-01 00:11:25 DEMO_STUD Casa_Blanca  BED  12-02 23-07   151 18.8 131
## 4 2011-01-01 00:16:25 DEMO_STUD Casa_Blanca  BED  12-02 23-07   151 18.8 140
## 5 2011-01-01 00:21:25 DEMO_STUD Casa_Blanca  BED  12-02 23-07   151 18.8 135
## 6 2011-01-01 00:26:25 DEMO_STUD Casa_Blanca  BED  12-02 23-07   168 18.7 131
stats <- annex_stats(annex_df, format = "long")
head(stats)
##       study        home room season tod variable stats      value
## 1 DEMO_STUD Casa_Blanca  BED  12-02 all    Radon  Mean   190.7551
## 2 DEMO_STUD Casa_Blanca  BED  12-02 all    Radon    Sd    60.8227
## 3 DEMO_STUD Casa_Blanca  BED  12-02 all    Radon     N 25185.0000
## 4 DEMO_STUD Casa_Blanca  BED  12-02 all    Radon   NAs     0.0000
## 5 DEMO_STUD Casa_Blanca  BED  12-02 all    Radon   p00    57.0000
## 6 DEMO_STUD Casa_Blanca  BED  12-02 all    Radon p00.5    69.0000

By default, the argument format is set to "wide" which will return the statistics in a wide format. Does not matter for the further analysis but can be convenient when processed manually.

Writing output file

TODO(R): Only include this in the write_and_validate.html article?

The final step is to write the data into the final standardized file format. annex_write_stats() takes up a annex_stat object (returned by annex_stats(); long or wide format) and a file name/path where the data should be stored.

In addition, a user (integer; user ID provided by the project team) must be provided.

TODO(R): Currently no update method is available.

annex_write_stats(stats, file = "final_Bedroom.xlsx", user = 123, quiet = TRUE)

This creates the file final_Bedroom.xlsx when successful.

Next steps

After performing the data preparation and calculating the statistics, the following steps can be performed: